Skip to content

Flink: Add decimal write/read roundtrip test for FlinkParquetReaders#16346

Merged
pvary merged 3 commits into
apache:mainfrom
wombatu-kun:flink-parquet-decimal-roundtrip-test
May 18, 2026
Merged

Flink: Add decimal write/read roundtrip test for FlinkParquetReaders#16346
pvary merged 3 commits into
apache:mainfrom
wombatu-kun:flink-parquet-decimal-roundtrip-test

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

@wombatu-kun wombatu-kun commented May 15, 2026

Summary

  • Address the longstanding TODO in FlinkParquetReaders.BinaryDecimalReader by adding a Flink-writer → Flink-reader roundtrip test for decimals across precisions 9/2, 15/3 and 38/10 (covering INT32/INT64/FIXED_LEN_BYTE_ARRAY physical encodings).
  • Test added to existing TestFlinkParquetReader in v1.20, v2.0 and v2.1; the TODO comment is removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the flink label May 15, 2026
@pvary
Copy link
Copy Markdown
Contributor

pvary commented May 15, 2026

Should we test this in the BaseFormatModelTests?

…odelTests

Per review feedback on apache#16346, instead of a Flink-only decimal Parquet test, add a StructWithDecimals generator to DataGenerators.ALL and a testDataWriterEngineWriteEngineRead case to BaseFormatModelTests. This exercises decimal(9,2)/(15,3)/(38,10) write-read roundtrips (Parquet INT32/INT64/FIXED_LEN_BYTE_ARRAY encodings) for Flink and Spark across Avro, Parquet and ORC, and closes the missing engine-write -> engine-read symmetry in the shared harness. The duplicated TestFlinkParquetReader decimal test is removed from Flink v1.20/v2.0/v2.1; the FlinkParquetReaders TODO stays resolved by the broader shared coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the data label May 15, 2026
@wombatu-kun
Copy link
Copy Markdown
Contributor Author

wombatu-kun commented May 15, 2026

Good call, thanks @pvary — done in 6d2f777.

Rather than a Flink-only test I added a Decimals generator (decimal(9,2) / (15,3) / (38,10), covering the Parquet INT32 / INT64 / FIXED_LEN_BYTE_ARRAY encodings) to DataGenerators.ALL, so decimals now flow through the whole BaseFormatModelTests matrix for Flink and Spark across Avro/Parquet/ORC.

One gap I hit while doing this: the generator-parameterized cases were only engineWrite→genericRead and genericWrite→engineRead, so the engine-writer→engine-reader roundtrip the original TODO asked for ("write-read-validate decimal via FlinkParquetWrite/Reader") wasn't actually covered by any single case. I added a testDataWriterEngineWriteEngineRead case to BaseFormatModelTests to close that symmetry — for all types/engines/formats, not just decimals.

The duplicated TestFlinkParquetReader decimal test is removed from v1.20/v2.0/v2.1; the FlinkParquetReaders TODO stays resolved by the broader shared coverage. Verified green: TestFlinkFormatModel (Flink 1.20/2.0/2.1) and TestSparkFormatModel (Spark 4.1).

}
}

static class StructWithDecimals implements DataGenerator {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Decimals

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — there's no nested struct here, so StructWithDecimals was a misleading name. Renamed the class to Decimals (and updated the ALL array) in 5dc4b5d.

static class StructWithDecimals implements DataGenerator {
private final Schema schema =
new Schema(
required(1, "row_id", Types.StringType.get()),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No — row_id isn't needed for the decimal write/read roundtrip and the generator has no identity semantics. Dropped the column and renumbered the decimal field IDs to 1/2/3 in 5dc4b5d.

…d column

The generator has no nested struct, so StructWithDecimals was a misleading
name; rename it to Decimals. The row_id String column is not needed for the
decimal write/read roundtrip and carries no identity semantics, so drop it
and renumber the decimal field IDs to 1/2/3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun
Copy link
Copy Markdown
Contributor Author

Thank you for review!

@pvary pvary merged commit ae4dfea into apache:main May 18, 2026
36 checks passed
@pvary
Copy link
Copy Markdown
Contributor

pvary commented May 18, 2026

Merged to main.
Thanks @wombatu-kun for the additional test!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants